Learning Tense Translation from Bilingual Corpora

ثبت نشده
چکیده

This paper studies and evaluates disambiguation strategies for the translation of tense between German and English, using a bilingual corpus of appointment scheduling dialogues. It describes a scheme to detect complex verb predicates based on verb form subcategorization and grammatical knowledge. The extracted verb and tense information is presented and the role of different context factors is discussed. 1 I n t r o d u c t i o n A problem for translation is its context dependence. For every ambiguous word, the part of the context relevant for disambiguation must be identified (disambiguation strategy), and every word potentially occurring in this context must be assigned a bias for the translation decision (disambigt, ation information). Manual construction of disambiguation components is quite a chore. Fortunately, the task can be (partly) automated if the tables associating words with biases are learned from a corpus. Statistical approaches also support empirical evaluation of different disambiguation strategies. The paper studies disambiguation strategies for tense translation between German and English. The experiments are based on a corpus of appointment scheduling dialogues counting 150,281 German and 154,773 English word tokens aligned in 16,857 turns. The dialogues were recorded, transcribed and translated in the German national Verbmobil project that aims to develop a tri-lingual spoken language translation system. Tense is interesting, since it occurs in nearly every sentence. Tense can be ex* This work was funded by the German Federal Ministry of Education, Science, Research and Technology (BMBF) in the framework of the Verbmobil Project under Grant 01 IV 101 U. Many thanks are due to G. Carroll, hi. Emele, U. Heid and the colleagues in Verbmobil. pressed on the surface lexically as well as morphosyntactically (analytic tenses). 2 W o r d s A r e N o t E n o u g h Often, sentence meaning is not compositional but arises from combinations of words (1). (1) a. Ich habe ihn gestern gesehen. I have him yesterday seen I saw him yesterday. b. Ich schlage Montag vor. I beat Monday forward I suggest Monday. c. Ich mSchte mich beschweren. I 'd like to myself weigh down I'd like to make a complaint. For translation, the discontinuous words must be amalgamated into single semantic items. Single words or pairs of lemma and part of speech tag (L-POS pairs) are not appropriate. To verify this claim, we aligned the L-POS pairs of the Verbmobil corpus using the completely language-independent method of Dagan et al. (1993). Below find the results for sehen 1 (see) in order of frequency and some frequent alignments for reflexive pronouns. sehen:VVFIN be:VBZ (aussehen) sehen:VVFIN do:VBP (do-support) sehen:VVFIN have:VBP (perfect) sehen:VVFIN see:VB 72 44 39 35 176 wir:PRF meet:VB (sich treffen) 33 wir:PRF we:PP 30 sich:PRF spell:VBN (sich schreiben) 16 ich:PRF forward:RP (sich freuen auf) 14 wir:PRF agree:VB (sich einigen) 13 ich:PRF myself:PP 1The prefix verb aus-sehen (look, be) is very frequent in the corpus, it often occurs in questions. Present sehen was frequently translated into perfect discover.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards producing bilingual lexica from monolingual corpora

Bilingual lexica are the basis for many cross-lingual natural language processing tasks. Recent works have shown success in learning bilingual dictionary by taking advantages of comparable corpora and a diverse set of signals derived from monolingual corpora. In the present work, we describe an approach to automatically learn bilingual lexica by training a supervised classifier using word embed...

متن کامل

Learning Method for Automatic Acquisition of Translation Knowledge

This paper presents a new learning method for automatic acquisition of translation knowledge from parallel corpora. We apply this learning method to automatic extraction of bilingual word pairs from parallel corpora. In general, similarity measures are used to extract bilingual word pairs from parallel corpora. However, similarity measures are insufficient because of the sparse data problem. Th...

متن کامل

English-Spanish Large Statistical Dictionary of Inflectional Forms

The paper presents an approach for constructing a weighted bilingual dictionary of inflectional forms using as input data a traditional bilingual dictionary, and not parallel corpora. An algorithm is developed that generates all possible morphological (inflectional) forms and weights them using information on distribution of corresponding grammar sets (grammar information) in large corpora for ...

متن کامل

Automatic extraction of bilingual word pairs using inductive chain learning in various languages

In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficien...

متن کامل

Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data

Statistical phrase-based translation learns translation rules from bilingual corpora, and has traditionally only used monolingual evidence to construct features that rescore existing translation candidates. In this work, we present a semi-supervised graph-based approach for generating new translation rules that leverages bilingual and monolingual data. The proposed technique first constructs ph...

متن کامل

Using a Random Forest Classifier to Compile Bilingual Dictionaries of Technical Terms from Comparable Corpora

We describe a machine learning approach, a Random Forest (RF) classifier, that is used to automatically compile bilingual dictionaries of technical terms from comparable corpora. We evaluate the RF classifier against a popular term alignment method, namely context vectors, and we report an improvement of the translation accuracy. As an application, we use the automatically extracted dictionary ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002